[megatron] support multimodal model CPT/SFT/DPO (Full/LoRA) #5502

Jintao-Huang · 2025-08-22T15:57:05Z

No description provided.

gemini-code-assist

Summary of Changes

Hello @Jintao-Huang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces initial support for multimodal models within the Megatron framework. The primary goal is to enable the integration of vision components, specifically for models like Qwen2.5VL, and to ensure that model configurations are correctly handled during the conversion process from HuggingFace formats to Megatron. This lays the groundwork for training and deploying models that can process both text and visual data.

Highlights

Multimodal Model Integration: Introduces a new Qwen2_5VL_Vit module to support Vision Transformer (ViT) components for multimodal models within the Megatron framework.
Configuration Handling for Multimodal Models: Enhances the model conversion utility (convert.py) to correctly extract and use text_config for multimodal models when converting HuggingFace configurations to Megatron.
Improved Argument Passing: Ensures essential model metadata (model_info and model_meta) is passed as extra arguments during Megatron initialization, facilitating proper setup for multimodal architectures.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds initial support for multimodal models within Megatron. The changes introduce a new Vision Transformer (ViT) model wrapper for Qwen2.5-VL and update the model conversion script to correctly handle multimodal configurations. My review focuses on the new ViT wrapper, where I've identified a couple of critical issues: a missing import that will cause a NameError, and an incorrect implementation of the get_input_embeds method that will lead to a runtime error. I've provided specific suggestions to address these problems. The other changes appear logical for the goal of adding multimodal capabilities.

Jintao-Huang · 2025-08-31T15:46:04Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for multimodal models within the Megatron framework, with a specific implementation for Qwen2.5-VL. The changes are extensive, touching upon argument handling, model architecture, weight conversion utilities, and the training process to accommodate visual components. Key additions include a Qwen2_5VL_Vit module for the visual tower, modifications to GPTModel to integrate visual embeddings, and updates to the conversion scripts to handle multimodal weights. My review identifies a critical bug in the initialization of the visual processor and a high-severity issue in how positional embedding types are checked, which could lead to incorrect behavior.

gemini-code-assist · 2025-08-31T15:47:55Z

swift/megatron/model/qwen2_5_vl/vit.py

+        model_dir = args.model_info.model_dir
+        kwargs = {'attn_impl': 'flash_attn'} if args.attention_backend.name == 'flash' else {}
+        with patch_device_map_meta(Qwen2_5_VLTextModel):
+            model, _ = get_model_tokenizer(model_dir, args.torch_dtype, return_dummy_model=True, **kwargs)


The processor is not stored after being loaded from get_model_tokenizer, but it's used later in get_inputs_embeds (e.g., self.processor.image_processor). This will raise an AttributeError at runtime. You should store the returned processor in self.processor.

Suggested change

model, _ = get_model_tokenizer(model_dir, args.torch_dtype, return_dummy_model=True, **kwargs)

model, self.processor = get_model_tokenizer(model_dir, args.torch_dtype, return_dummy_model=True, **kwargs)

gemini-code-assist · 2025-08-31T15:47:55Z

swift/megatron/model/gpt_model.py

                    rotary_seq_len,
                    packed_seq=packed_seq_params is not None and packed_seq_params.qkv_format == 'thd',
                )
+        elif self.position_embedding_type in 'mrope':


The condition self.position_embedding_type in 'mrope' is likely a bug. It checks if self.position_embedding_type is one of the characters 'm', 'r', 'o', 'p', 'e', not the string 'mrope'. This works by coincidence for now because no other position embedding types share these characters, but it's fragile and incorrect. It should be changed to a direct string comparison for correctness and clarity.

Suggested change

elif self.position_embedding_type in 'mrope':

elif self.position_embedding_type == 'mrope':

Jintao-Huang · 2025-09-01T13:15:04Z

/gemini review

gemini-code-assist

Code Review

This PR adds support for multimodal model SFT in Megatron, including new arguments, model classes, and training logic. The changes are extensive and well-structured. However, I've identified a critical issue related to the initialization order of Megatron components. Specifically, patch_megatron_tokenizer is called before Megatron is initialized, which will cause a runtime error. This has a cascading effect, including an uninitialized processor in the new Qwen2_5VL_Vit class. I've also found a minor typo in an error message. Addressing the initialization order is crucial for the new functionality to work.

gemini-code-assist · 2025-09-01T13:18:28Z

swift/megatron/train/sft.py

+        with torch.device('meta'):
+            self.model, self.processor = args.get_model_processor(**kwargs)
+        self._prepare_template()
        patch_megatron_tokenizer(self.processor)


The function patch_megatron_tokenizer calls megatron.training.get_args(), which requires Megatron to be initialized (i.e., initialize_megatron must have been called). However, patch_megatron_tokenizer is called here before self.trainer.train(), which is where initialize_megatron is eventually called. This will lead to a runtime error because megatron.training.get_args() will fail. The call to patch_megatron_tokenizer should be moved to a point after initialize_megatron has been executed, for example, within the MegatronTrainer after initialization.

gemini-code-assist · 2025-09-01T13:18:28Z

swift/megatron/model/mm_gpt/qwen2_5_vl.py

+    def __init__(self, config):
+        from transformers.models.qwen2_5_vl import Qwen2_5_VLTextModel
+        super().__init__(config)
+        args = get_args()
+        model_dir = args.model_info.model_dir
+        kwargs = {'attn_impl': 'flash_attn'} if args.attention_backend.name == 'flash' else {}
+        with patch_device_map_meta(Qwen2_5_VLTextModel):
+            model, _ = get_model_tokenizer(model_dir, args.torch_dtype, return_dummy_model=True, **kwargs)
+        self.model = model.visual.to('cuda')
+        self.model_config = model.config


The method get_inputs_embeds uses self.processor, but it is not initialized in the Qwen2_5VL_Vit class. This will cause an AttributeError at runtime. The processor should be made available to this class. Assuming the related initialization order issue is fixed and patch_megatron_tokenizer correctly sets args.tokenizer on the global Megatron arguments, you should initialize self.processor in the __init__ method.

Suggested change

def __init__(self, config):

from transformers.models.qwen2_5_vl import Qwen2_5_VLTextModel

super().__init__(config)

args = get_args()

model_dir = args.model_info.model_dir

kwargs = {'attn_impl': 'flash_attn'} if args.attention_backend.name == 'flash' else {}

with patch_device_map_meta(Qwen2_5_VLTextModel):

model, _ = get_model_tokenizer(model_dir, args.torch_dtype, return_dummy_model=True, **kwargs)

self.model = model.visual.to('cuda')

self.model_config = model.config

def __init__(self, config):

from transformers.models.qwen2_5_vl import Qwen2_5_VLTextModel

super().__init__(config)

args = get_args()

self.processor = args.tokenizer

model_dir = args.model_info.model_dir

kwargs = {'attn_impl': 'flash_attn'} if args.attention_backend.name == 'flash' else {}

with patch_device_map_meta(Qwen2_5_VLTextModel):

model, _ = get_model_tokenizer(model_dir, args.torch_dtype, return_dummy_model=True, **kwargs)

self.model = model.visual.to('cuda')

self.model_config = model.config

gemini-code-assist · 2025-09-01T13:18:29Z

swift/megatron/model/gpt_model.py

+                raise NotImplementedError('Flash decoding uses precomputed cos and sin for RoPE, not implmented in '
+                                          'MultimodalRotaryEmbedding yet.')


There is a typo in the error message. "implmented" should be "implemented".

Suggested change

raise NotImplementedError('Flash decoding uses precomputed cos and sin for RoPE, not implmented in '

'MultimodalRotaryEmbedding yet.')

raise NotImplementedError('Flash decoding uses precomputed cos and sin for RoPE, not implemented in '

'MultimodalRotaryEmbedding yet.')

Jintao-Huang · 2025-09-02T09:52:29Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces significant enhancements to support Supervised Fine-Tuning (SFT) for multimodal models within the Megatron framework, covering both full-parameter and LoRA tuning. Key additions include new arguments for controlling component freezing (LLM, ViT, aligner), a new MultimodalGPTModel class to handle multimodal inputs, and the implementation of Multimodal RoPE (mrope). The changes span across model definitions, data processing pipelines, training arguments, and documentation to provide comprehensive support for this new feature. The code is well-structured, and the changes are consistent. I have one suggestion to improve device handling for better portability.

gemini-code-assist · 2025-09-02T09:55:10Z

swift/megatron/model/mm_gpt/qwen2_5_vl.py

+        kwargs = {'attn_impl': 'flash_attn'} if args.attention_backend.name == 'flash' else {}
+        with patch_device_map_meta(Qwen2_5_VLTextModel):
+            model, _ = get_model_tokenizer(model_dir, args.torch_dtype, return_dummy_model=True, **kwargs)
+        self.model = model.visual.to('cuda')


The device is hardcoded to 'cuda'. This could cause issues if running in an environment without CUDA or on a specific CUDA device in a multi-GPU setup. It's better to use a more flexible way to specify the device, for example by using torch.cuda.current_device() which respects the device set for the current process by Megatron's initialization.

Suggested change

self.model = model.visual.to('cuda')

self.model = model.visual.to(torch.cuda.current_device())

Jintao-Huang · 2025-09-02T12:20:30Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces significant new functionality by adding support for Supervised Fine-Tuning (SFT) of multimodal models using Megatron, for both full-parameter and LoRA tuning. The changes are comprehensive, touching upon model definitions, training arguments, data processing, and documentation. Key additions include the MultimodalGPTModel class, which elegantly wraps the existing GPTModel to incorporate visual components, and new arguments like freeze_llm, freeze_vit, and freeze_aligner to control training. The implementation for mRoPE support and updates to context parallelism for multimodal data are also notable. The documentation has been updated with new tutorials and parameter descriptions, which is very helpful. My review found a couple of minor inconsistencies in the new documentation examples, which I've commented on. Overall, this is a well-executed and valuable addition to the framework.

gemini-code-assist · 2025-09-02T12:22:51Z

docs/source/Megatron-SWIFT/多模态模型.md

In the inference script example, the model path appears to be incorrect. Based on the context, the model should be Qwen2.5-VL-7B-Instruct, but it's written as Qwen2.5-7B-Instruct, missing the VL part. This could be confusing for users.

Suggested change

--model megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf \

--model megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf \

gemini-code-assist · 2025-09-02T12:22:51Z

docs/source_en/Megatron-SWIFT/Multimodal-Model.md

+MAX_PIXELS=1003520 \
+CUDA_VISIBLE_DEVICES=0 \
+swift infer \
+    --model megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf \


There seems to be a typo in the model path in the inference script example. Based on the context, the model should be Qwen2.5-VL-7B-Instruct, but it's written as Qwen2.5-7B-Instruct here, missing the VL part. This could be confusing for users.

Suggested change

--model megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf \

--model megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf \

gemini-code-assist

Code Review

This pull request introduces significant new functionality by adding support for Supervised Fine-Tuning (SFT) of multimodal models using Megatron, for both full-parameter and LoRA training. The changes are extensive, touching upon model definitions, argument parsing, data collation, training loops, and weight conversion utilities to accommodate multimodal inputs. New documentation and example scripts are provided to guide users. The implementation includes patches for m-RoPE and updates to handle visual components like ViT and aligners. The code appears well-structured and the changes are consistent across the codebase. I've identified a couple of minor typos in the new documentation files that should be corrected for clarity.

gemini-code-assist · 2025-09-02T12:23:42Z

docs/source/Megatron-SWIFT/多模态模型.md

There appears to be a typo in the model path. The model being trained is Qwen2.5-VL-7B-Instruct, so the path here should likely be megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf to match the output directory from the export step.

Suggested change

--model megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf \

--model megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf \

gemini-code-assist · 2025-09-02T12:23:42Z

docs/source_en/Megatron-SWIFT/Multimodal-Model.md

+MAX_PIXELS=1003520 \
+CUDA_VISIBLE_DEVICES=0 \
+swift infer \
+    --model megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf \


There appears to be a typo in the model path. The model being trained is Qwen2.5-VL-7B-Instruct, so the path here should likely be megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf to match the output directory from the export step.

Suggested change

--model megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf \

--model megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf \

update

abba497

gemini-code-assist bot reviewed Aug 22, 2025

View reviewed changes

Jintao-Huang added 7 commits August 24, 2025 20:27

update

0883b84

update

bdbaa9a

Merge branch 'main' into support_megatron_multimodal

efd6f72

Merge branch 'main' into support_megatron_multimodal

c0b28b4

update

0e60545

update

b349b56

update

6a77b0e

Jintao-Huang changed the title ~~[WIP] [megatron] support multimodal model~~ [megatron] support multimodal model Aug 31, 2025

Jintao-Huang added 2 commits August 31, 2025 23:44

Merge branch 'main' into support_megatron_multimodal

26d5f64

update

a502bbb

gemini-code-assist bot reviewed Aug 31, 2025

View reviewed changes

Merge branch 'main' into support_megatron_multimodal

2e92219

modelscope deleted a comment from gemini-code-assist bot Sep 1, 2025

Jintao-Huang added 7 commits September 1, 2025 14:15

fix

6fad478

update

2878d15

update

0cbada0

fix

3151603

update

93c4693

update

308d565

update

08320eb

Jintao-Huang changed the title ~~[megatron] support multimodal model~~ [megatron] support multimodal model SFT (Full/LoRA) Sep 1, 2025

gemini-code-assist bot reviewed Sep 1, 2025

View reviewed changes

Jintao-Huang added 2 commits September 1, 2025 23:42

fix

44a95bd

lint pass

50b2eb1

Jintao-Huang added 3 commits September 2, 2025 01:27

fix cp

51f315a

update

ef66901

lint pass

631691c

hjh0119 approved these changes Sep 2, 2025

View reviewed changes

Jintao-Huang added 4 commits September 2, 2025 17:21

update

587298e

update

a1cb64b

fix

798bdd4

fix

5953506

lint pass

0508aaf

gemini-code-assist bot reviewed Sep 2, 2025

View reviewed changes

Jintao-Huang added 4 commits September 2, 2025 18:14

update

6a431aa

update

3f35a86

update

0156a0c

fix

4c810cb

tastelikefeet approved these changes Sep 2, 2025

View reviewed changes

gemini-code-assist bot reviewed Sep 2, 2025

View reviewed changes

fix

d90d282

hjh0119 approved these changes Sep 2, 2025

View reviewed changes

Jintao-Huang changed the title ~~[megatron] support multimodal model SFT (Full/LoRA)~~ [megatron] support multimodal model SFT/DPO (Full/LoRA) Sep 2, 2025

Jintao-Huang changed the title ~~[megatron] support multimodal model SFT/DPO (Full/LoRA)~~ [megatron] support multimodal model CPT/SFT/DPO (Full/LoRA) Sep 2, 2025

Jintao-Huang added 2 commits September 3, 2025 00:26

update

f89a2bb

update

dc803a0

Jintao-Huang merged commit a161d54 into modelscope:main Sep 3, 2025
2 checks passed

	model, _ = get_model_tokenizer(model_dir, args.torch_dtype, return_dummy_model=True, **kwargs)
	model, self.processor = get_model_tokenizer(model_dir, args.torch_dtype, return_dummy_model=True, **kwargs)

	elif self.position_embedding_type in 'mrope':
	elif self.position_embedding_type == 'mrope':

		raise NotImplementedError('Flash decoding uses precomputed cos and sin for RoPE, not implmented in '
		'MultimodalRotaryEmbedding yet.')

	self.model = model.visual.to('cuda')
	self.model = model.visual.to(torch.cuda.current_device())

	--model megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf \
	--model megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf \

[megatron] support multimodal model CPT/SFT/DPO (Full/LoRA) #5502

[megatron] support multimodal model CPT/SFT/DPO (Full/LoRA) #5502

Uh oh!

Conversation

Jintao-Huang commented Aug 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Jintao-Huang commented Aug 31, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 31, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 31, 2025

Choose a reason for hiding this comment

Uh oh!

Jintao-Huang commented Sep 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

Jintao-Huang commented Sep 2, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

Jintao-Huang commented Sep 2, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone